Playing 2048

a reinforcement learning approach

Isac Pasianotto, Yuri Paglierani

7/21/23

What is 2048?

  • Single-player game
  • 4x4 grid populated by tiles (powers of 2)
  • Player can move tiles in up, down, left, right directions
  • slide tiles in the chosen as far as possible, merging tiles with the same value summing them

Example of a 2048 board

Model of the environment

We’ve created from scratch a class that represents the environment, extending the gymnasium.Env class:
import gym
class Game2048Env(gym.Env, object):
    def __init__(self, log_rewards=False): 
        #...
    def reset(self):
        #...
    def get_legit_actions(self):
        #...
    def _update_board(self, delta):
        #...
    def _UP(self):      #_UP, _DOWN, _LEFT, _RIGHT
        #...
    def _can_UP(self): #_can_UP, _can_DOWN, _can_LEFT, _can_RIGHT
        #...
    def _is_changed(self, board):
        #...
    def step(self, action):
        #...
    def add_random_tile(self):
        #...
    def _is_full_board(self):
        #...
    def _is_game_over(self):
        #...
  • The board in the environment is represented as a numpy.array(4,4) of integers (powers of 2), with 0 representing an empty tile.
The agent will interact with the environment through the step(action) function:
def step(self, action, verbose=False):
        """_Performs the action passed as argument, checks if the game is over
        and if it is not, adds a random tile to the board and updates the
        environment attributes. The reward is the increase of the score._

        Args:
            action (_int_): _The action to perform. 0: UP, 1: DOWN, 2: LEFT, 3: RIGHT_
            verbose (bool, optional): _If True, prints relevant information about the step taken (debugging purposes)_. Defaults to False.

        Returns:
            _self.board (np.array((4,4))): _The new board after the action is performed_
            _delta (int)_: _The increase of the score after the action is performed_
            _done (bool)_: _True if the game is over, False otherwise_
            _won (bool)_: _True if the game is won (2048 reached), False otherwise_
            _{}_: _Empty dictionary (required by the gym.Env class)_
        """

        delta = 0
        board = self.board.copy()
        if action == 0:
            delta = self._UP()
        if action == 1: 
            delta = self._DOWN()
        if action == 2: 
            delta = self._LEFT()
        if action == 3: 
            delta = self._RIGHT()
        if self._is_changed(board):
            self._add_random_tile()
            self.legit_actions = self.get_legit_actions()
        else:
            print("Invalid action taken!")

        # Check if the game is over (no more valid moves or 2048 tile reached)
        done, won = self._is_game_over()
        if won:
            if self.log_rewards:
                delta += 11
            else:
                delta += 2048
        if verbose:
            # prints some information about the step taken

        return self.board, delta, done, won, {}
Example of an action called by the step(action) function:
def _UP(self):
        """_
        Performs the action "UP". all the tiles are moved to the most upper position
        they can reach. If two tiles with the same value are in the same column, they are
        summed and the score is updated._

        Returns:
            _int_: _The sum of the values (or their log2 values if self.log_reward is True) of the
            new generates tile performing the action_
        """

        delta = 0
        for i in range(4):
            col = self.board[:, i]
            col = col[col != 0]
            col = np.concatenate((col, np.zeros(4-len(col))))
            for j in range(3):
                if col[j] == col[j+1]:
                    col[j] = col[j] * 2
                    self._update_score(col[j])
                    if self.log_rewards and col[j] != 0:
                        delta += np.log2(col[j])
                    else:
                        delta += col[j]
                    col[j+1] = 0
            col = col[col != 0]
            col = np.concatenate((col, np.zeros(4-len(col))))
            self.board[:, i] = col
        return delta

    def _can_UP(self):
        """
        Checks if an action "UP" could change the board_

        Returns:
            _Bool_: _True, if "UP" is a legit action, False otherwise_
        """

        for i in range(4):
            col = self.board[:, i]
            for j in range(1, 4):
                if col[j] != 0:
                    if col[j - 1] == 0 or col[j - 1] == col[j]:
                        return True
        return False

Reinforcement Learning Framework

Action space

  • The action space \(A\) is discrete and finite
  • However (in our implementation) it is not constant, since the set of legal actions changes at each step

\[ A(s) \subseteq \mathcal{A} = \{\text{up},\ \text{down},\ \text{left},\ \text{right}\} \qquad \forall s \in S \]

  • All the action are deteministic \(\Rightarrow \Pr(\text{action}=\text{execution})=1\)

State space

  • The state space \(S\) is discrete and finite
  • The cardinality \(|S|\) is not trivial to compute, but it is upper bounded by \(12^{16}\)
  • the estimation does not take into account some impossible states, but it’s good enough for giving an idea about how huge is \(|S|\)

Reward

In the original game, the score is incremented by the value of the tile generated merging two tiles by a move

This is a good starting point, but we’ve made various attempts of reward shaping to improve the learning process

  • considered the \(\log_{2}(\cdot)\) of the score increment: in order to homogenize the reward scale
  • penalization at each step: ideally this should enchourage the agent to merge tiles as soon as possible

These combinations of rewards shaping are controlled by the log_rewards of the environment and the penalize parameter of the agent

Solving the problem

Model-based or model-free?

  • Even if technically we could find the transition probabilities \(p(s'|s,a)\), in a model-base approach we would have to compute the expected state-action value:

\[ Q_{\pi}(s,a) = \sum_{s'}p(s'|s,a)\left[r(s,a,s') + \gamma V_{\pi}(s')\right] \]

which is not feasible.

\(\Rightarrow\) we have to use a model-free approach

\(S_0,\ A_0,\ R_1,\ S_1,\ A_1,\ R_2,\ \dots\)

TD-learning

We have used an Action-value method, the Temporal-Difference Learning, which is a model-free approach to estimate the optimal action-value function \(Q*(s,a)\)

\[Q^{*(\pi)} = \max_{\pi}{\mathbb{E}_{\pi}\left[R_t \ \vert s_t = s,\ a_t = a \right]}\]

There are many possible variations of the TD-learning algorithm (eg. Montecarlo, TD(0), TD(\(\lambda\)), …)

we have decided to use the TD(0) algorithm because even if it’s the simplest one, it’s also the most cheap in terms of computational cost.

TD-learning

if we compute the TD-error as \(\delta = R + \gamma\max_{a'}Q(S',a') - Q(S,A)\), we get what is called Q-learning algorithm.

DQN - Q-learning with Neural Networks

Why DQN?

  • In the original Q-learning algorithm, the action-value function is represented as a table updated at each step:

\(Q : S \times A \to \mathbb{R}\)

  • However, the state space is huge (upper bounded by \(12^{16}\)) and it makes not feasible the computation of the Q-table for both memory and computational cost \(\to\) curse of dimensionality

Solution: \(\to\) use a function approximator to represent the Q-function

  • Deep Q-Network: Q-learning with Neural Networks \(Q : S \to \mathbb{R}^{|A|}\)

DQN & Q-learning

Then the action \(a\) is chosen as \(\arg\max_{a}Q(s,a)\)

DQN Algorithm

\(\ ^{(1)}\): The loss function

  • The loss function usually used in DQN is the Mean Squared Error between the target and the prediction

  • However, we have decided to use the Huber Loss instead, because it is less sensitive to outliers

\[\mathcal{L}(\delta) = \mathcal{L}^{Huber}(\delta(\theta)) + \mathcal{L}^{L_2}(\delta(\theta))\]

\[\mathcal{L}^{Huber}(\delta) = \frac{1}{|B|}\sum_{i=1}^{|B|} l_i\]

where

\[ l_i =\begin{cases} \frac{1}{2} \delta_i^2 & \text{if } |\delta_i| \leq 1\\ \vert\delta_i\vert - \frac{1}{2} & \text{otherwise} \end{cases} \]

The \(\epsilon\)-greedy policy

The \(\epsilon\)-greedy policy is a way to balance the exploration and the exploitation of the agent

Usually, the \(\epsilon\) is high at the beginning of the training, and then it is decreased over time.

How to decrease it?

  1. Lai and Robbins: \(\epsilon(t(a)) = \frac{\eta}{1+\nu t(a)}\)
  2. Entropy regularization: \(\beta(a) = \alpha\log{(t(a)+1)}\)
  3. Torch-version : \(\epsilon = \epsilon_{end}+(\epsilon_{start}-\epsilon_{end})\exp{(-\frac{t}{\epsilon_{decay}})}\)

Results

Decay strategies

Lai and Robbins

Entropy regularization

Torch-version

Reward shaping

Train performance

Training a DQN agent for 600 episode, with progressive log-reward and entropy regularization we get:

Test performance

We’ve set as the baseline the performance of a random agent. Our trained DQN agent shows a significant improvement (even if it still was not able to reach the 2048 tile):

Thank you for your attention!